This assignment focuses on applying the Feature Engineering processes and the Evaluation methods to solve a practical scenario: Predict the price of houses.
The dataset is offered in two separated fields, one for the training and another one for the test set.
original_training_data = read.csv(file = file.path("dataset/house_price_train.csv"))
original_test_data = read.csv(file = file.path("dataset/house_price_test.csv"))
To avoid applying the Feature Engineering process two times (once for training and once for test), I will join both datasets (using the rbind function), apply the FE and then split the datasets again. However, if I try to do join the two dataframes as they are, we will get an error because they do not have the same columns: test_data does not have a column price. Therefore, I first create this column in the test set and then we join the data
original_test_data$price <- 0
dataset <- as.data.table(rbind(original_training_data, original_test_data))
class(dataset)
## [1] "data.table" "data.frame"
Let’s now visualize the dataset to see where to begin
summary(dataset)
## id date price bedrooms
## Min. :1.000e+06 6/23/2014: 142 Min. : 0 Min. : 1.000
## 1st Qu.:2.123e+09 6/25/2014: 131 1st Qu.: 220000 1st Qu.: 3.000
## Median :3.905e+09 6/26/2014: 131 Median : 384000 Median : 3.000
## Mean :4.580e+09 7/8/2014 : 127 Mean : 431877 Mean : 3.373
## 3rd Qu.:7.309e+09 4/27/2015: 126 3rd Qu.: 580000 3rd Qu.: 4.000
## Max. :9.900e+09 3/25/2015: 123 Max. :7700000 Max. :33.000
## (Other) :20817
## bathrooms sqft_living sqft_lot floors
## Min. :0.500 Min. : 370 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1430 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.116 Mean : 2080 Mean : 15099 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10685 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
##
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.00 Min. : 3.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.00 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.00 Median : 7.000
## Mean :0.007547 Mean :0.2343 Mean :3.41 Mean : 7.658
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.00 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.00 Max. :13.000
##
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 370 Min. : 0.0 Min. :1900 Min. : 0.00
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.00
## Median :1560 Median : 0.0 Median :1975 Median : 0.00
## Mean :1789 Mean : 291.7 Mean :1971 Mean : 84.46
## 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.00
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.00
##
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
##
## sqft_lot15
## Min. : 651
## 1st Qu.: 5100
## Median : 7620
## Mean : 12758
## 3rd Qu.: 10083
## Max. :871200
##
In this section I am going to preform some data cleaning.
The feature Id and deta are not going to offer any advantage for prediction.
dataset[, c('id', 'date'):=NULL]
colnames(dataset)
## [1] "price" "bedrooms" "bathrooms" "sqft_living"
## [5] "sqft_lot" "floors" "waterfront" "view"
## [9] "condition" "grade" "sqft_above" "sqft_basement"
## [13] "yr_built" "yr_renovated" "zipcode" "lat"
## [17] "long" "sqft_living15" "sqft_lot15"
If we go back to the summary of the dataset we can identify some numerical features that are actually categories: What we have to do is to convert them to the proper ‘class’ or ‘type’ using the as.factor command.
dataset$condition <- factor(dataset$condition)
dataset$grade <- factor(dataset$grade)
dataset$zipcode <- factor(dataset$zipcode)
dataset$yr_built <- factor(dataset$yr_built)
dataset$yr_renovated <- factor(dataset$yr_renovated)
dataset$view <- factor(dataset$view)
#dataset$date <- as.Date.character(dataset$date, '%m/%d/%Y')
# lets now turn characters into factors
dataset[ ,names(dataset)[sapply(dataset, is.character)]:=lapply(.SD,as.factor),
.SDcols = names(dataset)[sapply(dataset, is.character)]]
## Warning in `[.data.table`(dataset, , `:=`(names(dataset)[sapply(dataset, :
## length(LHS)==0; no columns to delete or assign RHS to.
Let’s check if there are NAs in the dataset
#Counting columns with null values.
na.cols <- which(colSums(is.na(dataset)) > 0)
paste('There are', length(na.cols), 'columns with missing values')
## [1] "There are 0 columns with missing values"
We will now focus on numerical values. In this section I will detect numerical features which present outliers and I will clip those values.
# Classify all numeric columns
column_types <- sapply(names(dataset), function(x) {
class(dataset[[x]])
}
)
numeric_columns <- names(column_types[column_types != "factor"])
# Identify the the feature with outliers
outliers <- scores(dataset[,numeric_columns, with=FALSE], type="chisq", prob=0.9)
per_outliers <- sort(((colSums(outliers)/length(dataset$price))*100), decreasing = TRUE)
col_outliers_to_remove <- names(per_outliers[per_outliers > 5 & names(per_outliers) != 'lat'
& names(per_outliers) != 'long'])
# Plot the boxplots of the selected variables
for (i in col_outliers_to_remove){
boxplot(dataset[,i, with=FALSE], type="p", xlab=i)
}
# clipping the outliers
for (i in col_outliers_to_remove){
qnt <- quantile(dataset[[i]], probs=c(.25, .75))
caps <- quantile(dataset[[i]], probs=c(.05, .95))
H <- 1.5 * IQR(dataset[[i]])
dataset[[i]][dataset[[i]] < (qnt[1] - H)] <- caps[1]
dataset[[i]][dataset[[i]] > (qnt[2] + H)] <- caps[2]
}
# Plot again the box plots to verify if the procedure has been succesful
for (i in col_outliers_to_remove){
boxplot(dataset[,i, with=FALSE], type="p", xlab=i)
}
We now need to detect skewness in the Target value. Let’s see what is the effect of skewness on a variable, and plot it using ggplot. The way of getting rid of the skewness is to use the log of the values of that feature, to flatten it.
# plot the skewness of the target variable
df <- rbind(data.frame(version="price",x=original_training_data$price),
data.frame(version="log(price+1)",x=log(original_training_data$price + 1)))
ggplot(data=df) +
facet_wrap(~version,ncol=2,scales="free_x") +
geom_histogram(aes(x=x), bins = 50)
We therefore transform the target value applying log
The same “skewness” observed in the target variable also affects other variables. To facilitate the application of the regression model we are going to also eliminate this skewness. For numeric feature with excessive skewness, perform log transformation
# setting up a threshold to decide to which numeric feature apply the log transformation
skewness_threshold = 1
Now, let’s compute the skewness of each feature that is not ‘factor’ nor ‘character’.
column_types <- sapply(names(dataset), function(x) {
class(dataset[[x]])
}
)
numeric_columns <- names(column_types[column_types != "factor"])
numeric_columns <- numeric_columns[numeric_columns != 'lat' & numeric_columns != 'long']
And now, with that information, we need to calculate the skewness of each column whose name is our list of factor (or categorical) features.
# skew of each variable
skew <- sapply(numeric_columns, function(x) {
e1071::skewness(dataset[[x]], na.rm = T)
}
)
What we do need to make now is to apply the log to those whose skewness value is below a given threshold that we’ve set in 1.
# transform all variables above a threshold skewness.
skew <- skew[abs(skew) > skewness_threshold]
for(x in names(skew)) {
dataset[[x]] <- log(dataset[[x]] + 1)
}
In this section I will create some new features to improve the predictive power of the dataset.
# having a look at the columns
colnames(dataset)
## [1] "price" "bedrooms" "bathrooms" "sqft_living"
## [5] "sqft_lot" "floors" "waterfront" "view"
## [9] "condition" "grade" "sqft_above" "sqft_basement"
## [13] "yr_built" "yr_renovated" "zipcode" "lat"
## [17] "long" "sqft_living15" "sqft_lot15"
# creating a total square feet feature
dataset$TotalSqFeet <- as.numeric(dataset$sqft_basement + dataset$sqft_above + dataset$sqft_lot + dataset$sqft_living)
# Creating a Remodrnatoin variable (0 = No Remodeling, 1 = Remodeling)
dataset$Remod <- ifelse(as.numeric(dataset$yr_built)==as.numeric(dataset$yr_renovated), 0, 1)
dataset$Remod <- factor(dataset$Remod)
# We will create a label that will agregate into "others" those grade with less than 3% of share
niche_grade<-names(which(summary(dataset$grade)/nrow(dataset)<0.01))
niche_grade
## [1] "3" "4" "12" "13"
dataset[, grade_agg:=as.factor(ifelse(grade%in%niche_grade,'others',as.character(grade)))]
summary(dataset$grade)/nrow(dataset)
## 3 4 5 6 7
## 4.630273e-05 1.250174e-03 1.120526e-02 9.436496e-02 4.155207e-01
## 8 9 10 11 12
## 2.808260e-01 1.210816e-01 5.250729e-02 1.847479e-02 4.120943e-03
## 13
## 6.019355e-04
summary(dataset$grade_agg)/nrow(dataset)
## 10 11 5 6 7 8
## 0.052507293 0.018474788 0.011205260 0.094364958 0.415520674 0.280826041
## 9 others
## 0.121081632 0.006019355
sum(summary(dataset$grade_agg)/nrow(dataset))
## [1] 1
dataset[, length(levels(grade_agg))]
## [1] 8
dataset[, length(levels(grade))]
## [1] 11
dataset[, length(levels(grade_agg))/length(levels(grade))-1] # important reduction in factor cathegories
## [1] -0.2727273
dataset[, grade_agg:=factor(grade_agg, levels=names(sort(summary(dataset$grade_agg), dec=T)))]
p<-ggplot(dataset, aes(x=grade_agg))+geom_bar(stat='count')+
theme(axis.text.x = element_text(angle=45))
p
# we drop off the former grade variable
dataset <- dataset[, grade:=NULL]
# Now lets check the number of categories per factor variable
factor_variables<-names(dataset)[sapply(dataset, is.factor)]
count_factor_variables<-sapply(dataset[,factor_variables, with=F], summary)
count_factor_variables
## $view
## 0 1 2 3 4
## 19475 332 961 510 319
##
## $condition
## 1 2 3 4 5
## 29 170 14020 5677 1701
##
## $yr_built
## 2014 2006 2005 2004 2003 1977 2007 1978 1968
## 559 453 450 433 420 417 417 387 381
## 2008 1967 1979 1959 1990 1962 1954 2001 1987
## 367 350 343 334 317 312 305 305 294
## 1989 1969 1955 1988 1947 1999 1963 1976 1950
## 290 280 271 270 263 263 255 253 249
## 1966 1994 1960 1980 1998 1948 2009 1951 1984
## 249 249 248 240 239 234 230 229 229
## 1985 1958 1961 1991 1942 1953 2002 1952 2000
## 227 224 224 224 223 222 222 220 218
## 1986 1983 1993 2013 1981 1956 1957 1992 1949
## 215 212 202 201 199 198 198 198 195
## 1996 1975 1965 1926 1997 1964 1943 2012 1995
## 194 189 187 180 177 172 170 170 169
## 1925 1974 1941 1940 1972 1973 2010 1944 1924
## 165 162 161 156 149 149 143 140 139
## 1910 1970 2011 1928 1946 1918 1927 1929 1939
## 134 132 130 126 126 120 115 114 106
## 1982 1971 1920 1922 1945 1909 1906 1930 1919
## 105 104 98 95 95 94 92 90 88
## 1900 1908 1923 1912 1916 1921 1905 1911 1937
## 87 86 84 79 79 76 74 73 68
## (Other)
## 748
##
## $yr_renovated
## 0 1934 1940 1944 1945 1946 1948 1950 1951 1953 1954 1955
## 20683 1 2 1 3 2 1 2 1 3 1 3
## 1956 1957 1958 1959 1960 1962 1963 1964 1965 1967 1968 1969
## 3 3 5 1 4 2 4 5 5 2 8 4
## 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981
## 9 2 4 5 3 6 3 8 6 10 11 5
## 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
## 11 18 18 17 17 18 15 22 25 20 17 19
## 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
## 19 16 15 15 19 17 35 19 22 36 26 35
## 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
## 24 35 18 22 18 13 11 37 91 16
##
## $zipcode
## 98001 98002 98003 98004 98005 98006 98007 98008 98010 98011 98014 98019
## 361 199 280 317 168 498 141 283 100 195 124 190
## 98022 98023 98024 98027 98028 98029 98030 98031 98032 98033 98034 98038
## 233 499 80 412 283 321 256 273 125 432 545 589
## 98039 98040 98042 98045 98052 98053 98055 98056 98058 98059 98065 98070
## 50 282 547 220 574 403 268 406 455 468 308 117
## 98072 98074 98075 98077 98092 98102 98103 98105 98106 98107 98108 98109
## 273 441 359 198 351 104 602 229 335 266 186 109
## 98112 98115 98116 98117 98118 98119 98122 98125 98126 98133 98136 98144
## 269 583 330 553 507 184 290 409 354 493 263 343
## 98146 98148 98155 98166 98168 98177 98178 98188 98198 98199
## 288 57 446 254 269 255 262 136 280 317
##
## $Remod
## 0 1
## 84 21513
##
## $grade_agg
## 7 8 9 6 10 11 5 others
## 8974 6065 2615 2038 1134 399 242 130
To facilitate the data cleaning and feature engineering we merged train and test datasets. We now split them again to create our final model.
training_data <- dataset[which(dataset$price!=0),]
test <- dataset[which(dataset$price==0),]
We are going to split the annotated dataset in training and validation for the later evaluation of our regression models
splits <- splitdf(training_data)
training <- splits$trainset
validation <- splits$testset
We here start the Feature Selection.
Let’s try first a baseline including all the features to evaluate the impact of the feature engineering.
lm.model(training, validation, "Baseline")
Let’s try to use information gain to select the most important variables. In this section I will also use a cross validation procedure to select the best threshold for my model. As it is possible to notice the results are better than the baseline so from now on I will use the ig_training for all the models.
# applying information gain with the optimal trashold found
weights<- data.frame(information.gain(formula, training))
weights$feature <- rownames(weights)
weights[order(weights$attr_importance, decreasing = TRUE),]
## attr_importance feature
## zipcode 0.4078836296 zipcode
## grade_agg 0.2888902300 grade_agg
## lat 0.2771141814 lat
## sqft_living 0.2733019702 sqft_living
## TotalSqFeet 0.2505272291 TotalSqFeet
## sqft_living15 0.2182127778 sqft_living15
## sqft_above 0.1955334775 sqft_above
## bathrooms 0.1437739598 bathrooms
## yr_built 0.0665257095 yr_built
## sqft_lot15 0.0618703528 sqft_lot15
## bedrooms 0.0593158206 bedrooms
## floors 0.0572538468 floors
## long 0.0527523602 long
## view 0.0486812688 view
## sqft_lot 0.0485797797 sqft_lot
## sqft_basement 0.0421027350 sqft_basement
## yr_renovated 0.0184476174 yr_renovated
## condition 0.0084169301 condition
## waterfront 0.0072894744 waterfront
## Remod 0.0004652503 Remod
information_gain_features <- weights$feature[weights$attr_importance > information_gain_CV(training, validation,
int_min = 0.002,
int_max = 0.006,
steps = 0.01)]
ig_training <- training[,c(information_gain_features,"price"),with=FALSE]
ig_test <- test[,c(information_gain_features,"price"),with=FALSE]
ig_validation <- validation[,c(information_gain_features,"price"),with=FALSE]
lm.model(ig_training, ig_validation, "Information Gain")
Let’s try the Ridge Regression
lambdas <- 10^seq(-3, 0, by = .05)
set.seed(121)
train_control_config <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
returnResamp = "all")
ridge.mod <- train(formula, data = ig_training,
method = "glmnet",
metric = "RMSE",
trControl=train_control_config,
tuneGrid = expand.grid(alpha = 0, lambda = lambdas))
Plotting the RMSE for the different lambda values, we can see the impact of this parameter in the model performance. Small values seem to work better for this dataset.
plot(ridge.mod)
Plotting the coefficients for different lambda values. As expected the larger the lambda (lower Norm) value the smaller the coefficients of the features. However, as we can see at the top of the features, there is no feature selection; i.e., the model always consider the 225 parameters.
plot(ridge.mod$finalModel)
# evaluating the model
ridge.mod.pred <- predict(ridge.mod, ig_validation)
ridge.mod.pred[is.na(ridge.mod.pred)] <- 0
my_data <- as.data.frame(cbind(predicted=(exp(ridge.mod.pred) -1), observed=(exp(validation$price) -1)))
ridge.mod.mape <- mape(ig_validation$price, ridge.mod.pred)
ridge.mod.price_error <- mean(abs((exp(ridge.mod.pred) -1) - (exp(validation$price) -1)))
# plottong the results
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("Ridge", ' MAPE: ', format(round(ridge.mod.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(ridge.mod.price_error, 0), nsmall=0),
'€', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
Rank the variables according to the importance attributed by the model.
# Print, plot variable importance
plot(varImp(ridge.mod), top = 20) # 20 most important features
Now Let’s try lasso in combination with information gain. Moreover, I will use cross validation to identify the best number of seed
# applying the lasso regression with the optimal seed value found above.
lambdas <- 10^seq(-3, 0, by = .05)
set.seed(lasso_CV(training, validation,10,60,10))
train_control_config <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
returnResamp = "all")
lasso.mod <- train(formula, data = ig_training,
method = "glmnet",
metric = "RMSE",
trControl=train_control_config,
tuneGrid = expand.grid(alpha = 1, lambda = lambdas))
Plotting the RMSE for the different lambda values, we can see the impact of this parameter in the model performance. Small values seem to work better for this dataset.
plot(lasso.mod)
Plotting the coefficients for different lambda values. As expected the larger the lambda (lower Norm) value the smaller the coefficients of the features. However, as we can see at the top of the features, there is no feature selection; i.e., the model always consider the 225 parameters.
plot(lasso.mod$finalModel)
# evaluating the model
lasso.mod.pred <- predict(lasso.mod, ig_validation)
lasso.mod.pred[is.na(lasso.mod.pred)] <- 0
my_data <- as.data.frame(cbind(predicted=(exp(lasso.mod.pred) -1), observed=(exp(validation$price) -1)))
lasso.mod.mape <- mape(lasso.mod.pred, validation$price)
lasso.mod.price_error <- mean(abs((exp(lasso.mod.pred) -1) - (exp(validation$price) -1)))
# plotting the results
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("lasso", ' MAPE: ', format(round(lasso.mod.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(lasso.mod.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
Rank the variables according to the importance attributed by the model.
# Print, plot variable importance
plot(varImp(lasso.mod), top = 20) # 20 most important features
Now Let’s try Random Forest. After running all the models with the the defoult params, XGB was the best performing. For this reason I will apply the Hyper-parameters tuning just for xgb.
# defining the model
rf_0<-ranger(formula=formula, data=ig_training)
rf.pred<-predict(rf_0, data = ig_validation)$predictions
# evaluation
rf.mape <- mape(rf.pred, validation$price)
rf.price_error <- mean(abs((exp(rf.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(rf.pred) -1), observed=(exp(validation$price) -1)))
# plotting the results
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("Random Forest", ' MAPE: ', format(round(rf.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(rf.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
In this section we’ll use xgb. First, we have to transform the data to allow xgb to work. We need to create dummies and to cast the data into a matrix
xgb_ig_training <- caret::dummyVars(formula= ~., data = ig_training, fullRank=T,sep = "_")
xgb_ig_training <- data.table(predict(xgb_ig_training, newdata = ig_training))
xgb_ig_training <- as.matrix(xgb_ig_training[, !'price', with=F])
xgb_ig_validation <- caret::dummyVars(formula= ~., data = ig_validation, fullRank=T,sep = "_")
xgb_ig_validation <- data.table(predict(xgb_ig_validation, newdata = ig_validation))
xgb_ig_validation <- as.matrix(xgb_ig_validation[, !'price', with=F])
xgb_ig_test <- caret::dummyVars(formula= ~., data = ig_test, fullRank=T,sep = "_")
xgb_ig_test <- data.table(predict(xgb_ig_test, newdata = ig_test))
xgb_ig_test <- as.matrix(xgb_ig_test[, !'price', with=F])
Now let’s define our first model using all the default params
# defining the model
xgb_0<-xgboost(booster='gbtree',
data=xgb_ig_training,
label= training$price,
nrounds = 100,
objective='reg:linear')
## [1] train-rmse:8.797826
## [2] train-rmse:6.163718
## [3] train-rmse:4.320491
## [4] train-rmse:3.031123
## [5] train-rmse:2.129980
## [6] train-rmse:1.500232
## [7] train-rmse:1.061951
## [8] train-rmse:0.758129
## [9] train-rmse:0.549899
## [10] train-rmse:0.408636
## [11] train-rmse:0.313780
## [12] train-rmse:0.255180
## [13] train-rmse:0.219143
## [14] train-rmse:0.196475
## [15] train-rmse:0.183525
## [16] train-rmse:0.176495
## [17] train-rmse:0.170329
## [18] train-rmse:0.166563
## [19] train-rmse:0.164260
## [20] train-rmse:0.162487
## [21] train-rmse:0.160702
## [22] train-rmse:0.158476
## [23] train-rmse:0.157014
## [24] train-rmse:0.155735
## [25] train-rmse:0.154442
## [26] train-rmse:0.153304
## [27] train-rmse:0.152436
## [28] train-rmse:0.151475
## [29] train-rmse:0.149598
## [30] train-rmse:0.148630
## [31] train-rmse:0.147728
## [32] train-rmse:0.146997
## [33] train-rmse:0.145515
## [34] train-rmse:0.144383
## [35] train-rmse:0.143948
## [36] train-rmse:0.143151
## [37] train-rmse:0.141956
## [38] train-rmse:0.141490
## [39] train-rmse:0.140731
## [40] train-rmse:0.140127
## [41] train-rmse:0.138974
## [42] train-rmse:0.138424
## [43] train-rmse:0.138132
## [44] train-rmse:0.137885
## [45] train-rmse:0.137498
## [46] train-rmse:0.136887
## [47] train-rmse:0.136329
## [48] train-rmse:0.135181
## [49] train-rmse:0.134821
## [50] train-rmse:0.134519
## [51] train-rmse:0.133952
## [52] train-rmse:0.133598
## [53] train-rmse:0.133463
## [54] train-rmse:0.132937
## [55] train-rmse:0.132142
## [56] train-rmse:0.131602
## [57] train-rmse:0.130801
## [58] train-rmse:0.130180
## [59] train-rmse:0.130006
## [60] train-rmse:0.129591
## [61] train-rmse:0.129177
## [62] train-rmse:0.128540
## [63] train-rmse:0.128336
## [64] train-rmse:0.127511
## [65] train-rmse:0.127066
## [66] train-rmse:0.126900
## [67] train-rmse:0.126551
## [68] train-rmse:0.126350
## [69] train-rmse:0.125934
## [70] train-rmse:0.125443
## [71] train-rmse:0.125126
## [72] train-rmse:0.124738
## [73] train-rmse:0.124470
## [74] train-rmse:0.124130
## [75] train-rmse:0.123966
## [76] train-rmse:0.123661
## [77] train-rmse:0.123295
## [78] train-rmse:0.122729
## [79] train-rmse:0.122505
## [80] train-rmse:0.121792
## [81] train-rmse:0.121512
## [82] train-rmse:0.121117
## [83] train-rmse:0.120924
## [84] train-rmse:0.120695
## [85] train-rmse:0.120372
## [86] train-rmse:0.119882
## [87] train-rmse:0.119450
## [88] train-rmse:0.119280
## [89] train-rmse:0.119183
## [90] train-rmse:0.118754
## [91] train-rmse:0.118551
## [92] train-rmse:0.118238
## [93] train-rmse:0.117803
## [94] train-rmse:0.117400
## [95] train-rmse:0.117203
## [96] train-rmse:0.116636
## [97] train-rmse:0.116284
## [98] train-rmse:0.116028
## [99] train-rmse:0.115786
## [100] train-rmse:0.115674
xgb.pred<-predict(xgb_0, newdata = xgb_ig_validation, type='response')
# evaluation
xgb.mape <- mape(xgb.pred, validation$price)
xgb.price_error <- mean(abs((exp(xgb.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(xgb.pred) -1), observed=(exp(validation$price) -1)))
# Plotting the results
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("XGBoost", ' MAPE: ', format(round(xgb.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(xgb.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
To have a better performance I will use grid-search to find the optimal value of the function’s hyperparameters.
searchGridSubCol <- expand.grid(subsample = c(0.9, 1),
colsample_bytree = c(0.5, 0.6),
max_depth = c(8, 9),
min_child = seq(1),
eta = c(0.1, 0.2),
gamma = c(1)
)
ntrees <- 100
system.time(
rmseErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentDepth <- parameterList[["max_depth"]]
currentEta <- parameterList[["eta"]]
currentMinChild <- parameterList[["min_child"]]
currentgamma <- parameterList[["gamma"]]
xgboostModelCV <- xgb.cv(data = xgb_ig_training,
label = training$price,
nrounds = ntrees,
nfold = 5,
showsd = TRUE,
metrics = "rmse",
verbose = TRUE,
"eval_metric" = "rmse",
"objective" = "reg:linear",
"max.depth" = currentDepth,
"eta" = currentEta,
"gamma" = currentgamma,
"subsample" = currentSubsampleRate,
"colsample_bytree" = currentColsampleRate,
print_every_n = 10,
"min_child_weight" = currentMinChild,
booster = "gbtree",
early_stopping_rounds = 10)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
rmse <- tail(xvalidationScores$test_rmse_mean, 1)
trmse <- tail(xvalidationScores$train_rmse_mean,1)
output <- return(c(rmse, trmse, currentSubsampleRate, currentColsampleRate, currentDepth, currentEta, currentMinChild))}))
## [1] train-rmse:11.307232+0.003575 test-rmse:11.306831+0.013733
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:3.956090+0.001376 test-rmse:3.955863+0.008751
## [21] train-rmse:1.401310+0.000663 test-rmse:1.402677+0.005352
## [31] train-rmse:0.525391+0.001488 test-rmse:0.528876+0.002554
## [41] train-rmse:0.253110+0.001386 test-rmse:0.260110+0.003505
## [51] train-rmse:0.190346+0.001897 test-rmse:0.199892+0.003959
## [61] train-rmse:0.177906+0.001261 test-rmse:0.188094+0.003199
## [71] train-rmse:0.174978+0.000904 test-rmse:0.185344+0.002781
## [81] train-rmse:0.174127+0.000669 test-rmse:0.184627+0.002715
## [91] train-rmse:0.173718+0.000520 test-rmse:0.184250+0.002643
## [100] train-rmse:0.173216+0.000468 test-rmse:0.183770+0.002804
## [1] train-rmse:11.307014+0.001923 test-rmse:11.307285+0.008602
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:3.956350+0.001291 test-rmse:3.956666+0.009927
## [21] train-rmse:1.400062+0.000996 test-rmse:1.401425+0.009593
## [31] train-rmse:0.524502+0.000522 test-rmse:0.528403+0.008602
## [41] train-rmse:0.251462+0.000951 test-rmse:0.259498+0.005381
## [51] train-rmse:0.188306+0.000267 test-rmse:0.198668+0.003268
## [61] train-rmse:0.176970+0.000398 test-rmse:0.188043+0.003048
## [71] train-rmse:0.173956+0.000604 test-rmse:0.185355+0.003128
## [81] train-rmse:0.172623+0.000752 test-rmse:0.184143+0.003641
## [91] train-rmse:0.172154+0.000643 test-rmse:0.183682+0.003480
## [100] train-rmse:0.171877+0.000912 test-rmse:0.183426+0.003376
## [1] train-rmse:11.307141+0.002329 test-rmse:11.306845+0.009424
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:3.956319+0.000762 test-rmse:3.956198+0.007355
## [21] train-rmse:1.399648+0.000654 test-rmse:1.400617+0.007340
## [31] train-rmse:0.523590+0.000767 test-rmse:0.527064+0.006551
## [41] train-rmse:0.252202+0.000842 test-rmse:0.259090+0.004965
## [51] train-rmse:0.189118+0.001027 test-rmse:0.198630+0.003970
## [61] train-rmse:0.177143+0.000630 test-rmse:0.187357+0.003825
## [71] train-rmse:0.174601+0.000677 test-rmse:0.184941+0.003782
## [81] train-rmse:0.173662+0.000597 test-rmse:0.184154+0.003641
## [91] train-rmse:0.173413+0.000616 test-rmse:0.183914+0.003761
## [100] train-rmse:0.172571+0.000935 test-rmse:0.183258+0.003574
## [1] train-rmse:11.306962+0.001160 test-rmse:11.307312+0.004800
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:3.955661+0.000749 test-rmse:3.956050+0.003201
## [21] train-rmse:1.398606+0.000567 test-rmse:1.399727+0.003935
## [31] train-rmse:0.522329+0.000820 test-rmse:0.526007+0.003907
## [41] train-rmse:0.250535+0.000899 test-rmse:0.258675+0.002386
## [51] train-rmse:0.186684+0.000963 test-rmse:0.197348+0.001455
## [61] train-rmse:0.176029+0.000936 test-rmse:0.187403+0.001747
## [71] train-rmse:0.174044+0.001169 test-rmse:0.185552+0.001632
## [81] train-rmse:0.172895+0.000835 test-rmse:0.184476+0.001924
## [91] train-rmse:0.172101+0.000511 test-rmse:0.183745+0.002111
## [100] train-rmse:0.171880+0.000530 test-rmse:0.183556+0.002037
## [1] train-rmse:11.307011+0.002885 test-rmse:11.307003+0.011734
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:3.956411+0.001470 test-rmse:3.956651+0.010595
## [21] train-rmse:1.400518+0.000987 test-rmse:1.402248+0.010195
## [31] train-rmse:0.524826+0.000578 test-rmse:0.528903+0.007911
## [41] train-rmse:0.252578+0.000574 test-rmse:0.260608+0.005307
## [51] train-rmse:0.188710+0.001215 test-rmse:0.199616+0.004363
## [61] train-rmse:0.177361+0.001244 test-rmse:0.189011+0.003520
## [71] train-rmse:0.174466+0.001477 test-rmse:0.186377+0.003056
## [81] train-rmse:0.173376+0.001306 test-rmse:0.185395+0.003263
## [91] train-rmse:0.172685+0.001576 test-rmse:0.184792+0.003205
## [100] train-rmse:0.172498+0.001705 test-rmse:0.184601+0.003167
## [1] train-rmse:11.307183+0.001649 test-rmse:11.307488+0.006868
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:3.956694+0.000627 test-rmse:3.957390+0.006014
## [21] train-rmse:1.399802+0.000808 test-rmse:1.401128+0.005196
## [31] train-rmse:0.524272+0.000679 test-rmse:0.527692+0.003435
## [41] train-rmse:0.251448+0.001029 test-rmse:0.258833+0.003931
## [51] train-rmse:0.187942+0.001334 test-rmse:0.197844+0.005289
## [61] train-rmse:0.176837+0.001391 test-rmse:0.187427+0.006227
## [71] train-rmse:0.173865+0.001269 test-rmse:0.184890+0.006283
## [81] train-rmse:0.172776+0.001204 test-rmse:0.183899+0.006178
## [91] train-rmse:0.172077+0.000906 test-rmse:0.183290+0.006336
## [100] train-rmse:0.171896+0.000790 test-rmse:0.183104+0.006486
## [1] train-rmse:11.307043+0.001891 test-rmse:11.306707+0.007990
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:3.955851+0.000880 test-rmse:3.955356+0.004055
## [21] train-rmse:1.399575+0.000656 test-rmse:1.400509+0.002731
## [31] train-rmse:0.523008+0.000675 test-rmse:0.526506+0.003528
## [41] train-rmse:0.250699+0.001370 test-rmse:0.258097+0.005178
## [51] train-rmse:0.187693+0.001545 test-rmse:0.197632+0.004785
## [61] train-rmse:0.175870+0.001511 test-rmse:0.186523+0.004815
## [71] train-rmse:0.173732+0.001331 test-rmse:0.184567+0.004717
## [81] train-rmse:0.172801+0.001120 test-rmse:0.183714+0.004375
## [91] train-rmse:0.172224+0.001302 test-rmse:0.183230+0.004476
## [100] train-rmse:0.171872+0.001446 test-rmse:0.182990+0.004611
## [1] train-rmse:11.306971+0.001272 test-rmse:11.306683+0.005017
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:3.955715+0.000700 test-rmse:3.955347+0.003195
## [21] train-rmse:1.398353+0.000524 test-rmse:1.399172+0.002621
## [31] train-rmse:0.522113+0.001353 test-rmse:0.525807+0.002733
## [41] train-rmse:0.249727+0.001702 test-rmse:0.257901+0.001878
## [51] train-rmse:0.186344+0.000994 test-rmse:0.197200+0.002564
## [61] train-rmse:0.175018+0.000932 test-rmse:0.186739+0.002840
## [71] train-rmse:0.172678+0.000655 test-rmse:0.184704+0.003123
## [81] train-rmse:0.172045+0.000882 test-rmse:0.184150+0.003314
## [91] train-rmse:0.171488+0.000947 test-rmse:0.183688+0.003376
## [100] train-rmse:0.171154+0.000799 test-rmse:0.183386+0.003295
## [1] train-rmse:10.053293+0.001959 test-rmse:10.053207+0.007510
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:1.108818+0.001403 test-rmse:1.109944+0.004515
## [21] train-rmse:0.222271+0.002383 test-rmse:0.231639+0.003346
## [31] train-rmse:0.179618+0.002389 test-rmse:0.191244+0.001997
## [41] train-rmse:0.176600+0.002033 test-rmse:0.188615+0.002456
## [51] train-rmse:0.175030+0.001605 test-rmse:0.187260+0.003339
## [61] train-rmse:0.174173+0.001575 test-rmse:0.186585+0.003415
## [71] train-rmse:0.173923+0.001376 test-rmse:0.186385+0.003382
## [81] train-rmse:0.173671+0.001199 test-rmse:0.186174+0.003288
## [91] train-rmse:0.173430+0.001240 test-rmse:0.185923+0.003080
## [100] train-rmse:0.173270+0.001445 test-rmse:0.185778+0.003101
## [1] train-rmse:10.052837+0.001501 test-rmse:10.052746+0.008566
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:1.108060+0.000456 test-rmse:1.109155+0.004970
## [21] train-rmse:0.222517+0.001911 test-rmse:0.231039+0.003742
## [31] train-rmse:0.178248+0.002797 test-rmse:0.189178+0.003792
## [41] train-rmse:0.174303+0.001093 test-rmse:0.185641+0.004381
## [51] train-rmse:0.173027+0.001542 test-rmse:0.184653+0.004472
## [61] train-rmse:0.172721+0.001591 test-rmse:0.184401+0.004494
## [71] train-rmse:0.172530+0.001527 test-rmse:0.184273+0.004594
## Stopping. Best iteration:
## [64] train-rmse:0.172530+0.001527 test-rmse:0.184273+0.004594
##
## [1] train-rmse:10.053011+0.001197 test-rmse:10.053136+0.007376
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:1.108900+0.001710 test-rmse:1.109776+0.006723
## [21] train-rmse:0.222347+0.002502 test-rmse:0.230155+0.004941
## [31] train-rmse:0.178977+0.001568 test-rmse:0.189399+0.003077
## [41] train-rmse:0.175680+0.001386 test-rmse:0.186553+0.002764
## [51] train-rmse:0.174520+0.001817 test-rmse:0.185486+0.002434
## [61] train-rmse:0.173532+0.001698 test-rmse:0.184750+0.002457
## [71] train-rmse:0.173019+0.001860 test-rmse:0.184366+0.002361
## [81] train-rmse:0.172799+0.001803 test-rmse:0.184173+0.002472
## [91] train-rmse:0.172492+0.001750 test-rmse:0.183942+0.002552
## [100] train-rmse:0.172315+0.001708 test-rmse:0.183800+0.002525
## [1] train-rmse:10.052717+0.001533 test-rmse:10.052326+0.008017
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:1.107835+0.000756 test-rmse:1.110334+0.004745
## [21] train-rmse:0.218819+0.001301 test-rmse:0.228684+0.002670
## [31] train-rmse:0.176305+0.000710 test-rmse:0.188047+0.002909
## [41] train-rmse:0.174263+0.001311 test-rmse:0.186232+0.002891
## [51] train-rmse:0.173381+0.001637 test-rmse:0.185312+0.003193
## [61] train-rmse:0.172554+0.000784 test-rmse:0.184618+0.002670
## [71] train-rmse:0.172313+0.000807 test-rmse:0.184412+0.002834
## [81] train-rmse:0.171482+0.001151 test-rmse:0.183786+0.003130
## [91] train-rmse:0.171110+0.001107 test-rmse:0.183509+0.003227
## [100] train-rmse:0.170757+0.001151 test-rmse:0.183240+0.002905
## [1] train-rmse:10.053125+0.001213 test-rmse:10.053862+0.007412
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:1.111033+0.002788 test-rmse:1.112406+0.003934
## [21] train-rmse:0.223799+0.002406 test-rmse:0.232278+0.002338
## [31] train-rmse:0.178900+0.001828 test-rmse:0.189787+0.003166
## [41] train-rmse:0.175279+0.000961 test-rmse:0.186339+0.003881
## [51] train-rmse:0.174262+0.001139 test-rmse:0.185435+0.004002
## [61] train-rmse:0.173018+0.001337 test-rmse:0.184279+0.004317
## [71] train-rmse:0.172380+0.001342 test-rmse:0.183883+0.004295
## [81] train-rmse:0.171737+0.001261 test-rmse:0.183318+0.003670
## [91] train-rmse:0.171670+0.001162 test-rmse:0.183258+0.003741
## [100] train-rmse:0.171616+0.001156 test-rmse:0.183197+0.003741
## [1] train-rmse:10.052551+0.001162 test-rmse:10.052608+0.006796
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:1.108106+0.001303 test-rmse:1.108728+0.003983
## [21] train-rmse:0.218216+0.002153 test-rmse:0.226854+0.001800
## [31] train-rmse:0.176149+0.001751 test-rmse:0.187639+0.001228
## [41] train-rmse:0.173704+0.001528 test-rmse:0.185516+0.001598
## [51] train-rmse:0.172177+0.000804 test-rmse:0.184470+0.002161
## [61] train-rmse:0.171826+0.000875 test-rmse:0.184128+0.001980
## [71] train-rmse:0.171155+0.000866 test-rmse:0.183639+0.002174
## [81] train-rmse:0.170955+0.001063 test-rmse:0.183447+0.002096
## [91] train-rmse:0.170574+0.001136 test-rmse:0.183086+0.001824
## [100] train-rmse:0.170368+0.000983 test-rmse:0.182906+0.001878
## [1] train-rmse:10.052856+0.001827 test-rmse:10.052921+0.008920
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:1.109055+0.001774 test-rmse:1.110953+0.005484
## [21] train-rmse:0.219752+0.002847 test-rmse:0.229379+0.004159
## [31] train-rmse:0.177726+0.001581 test-rmse:0.189392+0.006378
## [41] train-rmse:0.174278+0.001516 test-rmse:0.186306+0.006723
## [51] train-rmse:0.172554+0.000868 test-rmse:0.184645+0.007007
## [61] train-rmse:0.172297+0.000703 test-rmse:0.184422+0.007188
## [71] train-rmse:0.171537+0.000750 test-rmse:0.183880+0.006892
## [81] train-rmse:0.170784+0.001085 test-rmse:0.183293+0.006652
## [91] train-rmse:0.170454+0.001235 test-rmse:0.183013+0.006539
## [100] train-rmse:0.170454+0.001235 test-rmse:0.183015+0.006539
## Stopping. Best iteration:
## [90] train-rmse:0.170454+0.001235 test-rmse:0.183011+0.006542
##
## [1] train-rmse:10.052655+0.001763 test-rmse:10.052597+0.007383
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
##
## [11] train-rmse:1.108083+0.000907 test-rmse:1.109058+0.004428
## [21] train-rmse:0.217753+0.001221 test-rmse:0.227060+0.003340
## [31] train-rmse:0.175264+0.001139 test-rmse:0.187611+0.002281
## [41] train-rmse:0.172835+0.001630 test-rmse:0.185387+0.002990
## [51] train-rmse:0.171869+0.001867 test-rmse:0.184524+0.002929
## [61] train-rmse:0.171642+0.001892 test-rmse:0.184320+0.002733
## [71] train-rmse:0.171557+0.001951 test-rmse:0.184240+0.002810
## Stopping. Best iteration:
## [62] train-rmse:0.171556+0.001951 test-rmse:0.184239+0.002810
## user system elapsed
## 1175.041 11.685 1308.516
# finding the best combination of hyperparameters
output <- as.data.table(t(rmseErrorsHyperparameters))
varnames <- c("TestRMSE", "TrainRMSE", "SubSampRate", "ColSampRate", "Depth", "eta", "gamma")
names(output) <- varnames
optmimized_param <- output[TestRMSE == min(TestRMSE),3:7]
Now we will define the model with the best hyperparameters.
# Defining the model
xgb_opt<-xgboost(booster='gbtree',
data=xgb_ig_training,
label= training$price,
nrounds = ntrees,
max.depth = optmimized_param[,'Depth'],
eta = optmimized_param[,'eta'],
subsample = optmimized_param[,'SubSampRate'],
colsample_bytree = optmimized_param[,'ColSampRate'],
min_child_weight = 1,
objective='reg:linear')
## [1] train-rmse:10.052924
## [2] train-rmse:8.045995
## [3] train-rmse:6.440963
## [4] train-rmse:5.156448
## [5] train-rmse:4.129259
## [6] train-rmse:3.307750
## [7] train-rmse:2.650942
## [8] train-rmse:2.126669
## [9] train-rmse:1.707042
## [10] train-rmse:1.372038
## [11] train-rmse:1.105457
## [12] train-rmse:0.892457
## [13] train-rmse:0.723715
## [14] train-rmse:0.590428
## [15] train-rmse:0.484095
## [16] train-rmse:0.401033
## [17] train-rmse:0.336576
## [18] train-rmse:0.287551
## [19] train-rmse:0.249214
## [20] train-rmse:0.222039
## [21] train-rmse:0.200967
## [22] train-rmse:0.185659
## [23] train-rmse:0.174319
## [24] train-rmse:0.166255
## [25] train-rmse:0.160776
## [26] train-rmse:0.156682
## [27] train-rmse:0.151849
## [28] train-rmse:0.149243
## [29] train-rmse:0.147081
## [30] train-rmse:0.144992
## [31] train-rmse:0.143502
## [32] train-rmse:0.142305
## [33] train-rmse:0.141352
## [34] train-rmse:0.139594
## [35] train-rmse:0.138257
## [36] train-rmse:0.137356
## [37] train-rmse:0.135985
## [38] train-rmse:0.133537
## [39] train-rmse:0.132951
## [40] train-rmse:0.130554
## [41] train-rmse:0.129795
## [42] train-rmse:0.128975
## [43] train-rmse:0.128113
## [44] train-rmse:0.127190
## [45] train-rmse:0.126474
## [46] train-rmse:0.125018
## [47] train-rmse:0.123243
## [48] train-rmse:0.122736
## [49] train-rmse:0.121889
## [50] train-rmse:0.121351
## [51] train-rmse:0.120466
## [52] train-rmse:0.119956
## [53] train-rmse:0.119519
## [54] train-rmse:0.119019
## [55] train-rmse:0.118506
## [56] train-rmse:0.118056
## [57] train-rmse:0.117613
## [58] train-rmse:0.116505
## [59] train-rmse:0.115565
## [60] train-rmse:0.115249
## [61] train-rmse:0.114792
## [62] train-rmse:0.114130
## [63] train-rmse:0.113488
## [64] train-rmse:0.112393
## [65] train-rmse:0.112193
## [66] train-rmse:0.111713
## [67] train-rmse:0.111415
## [68] train-rmse:0.110427
## [69] train-rmse:0.110075
## [70] train-rmse:0.109098
## [71] train-rmse:0.108790
## [72] train-rmse:0.107891
## [73] train-rmse:0.107494
## [74] train-rmse:0.107273
## [75] train-rmse:0.106934
## [76] train-rmse:0.106121
## [77] train-rmse:0.105797
## [78] train-rmse:0.105413
## [79] train-rmse:0.105112
## [80] train-rmse:0.104701
## [81] train-rmse:0.104271
## [82] train-rmse:0.103689
## [83] train-rmse:0.102949
## [84] train-rmse:0.102561
## [85] train-rmse:0.101933
## [86] train-rmse:0.101014
## [87] train-rmse:0.100553
## [88] train-rmse:0.100383
## [89] train-rmse:0.100099
## [90] train-rmse:0.099597
## [91] train-rmse:0.099309
## [92] train-rmse:0.098876
## [93] train-rmse:0.098775
## [94] train-rmse:0.098658
## [95] train-rmse:0.098342
## [96] train-rmse:0.097966
## [97] train-rmse:0.097734
## [98] train-rmse:0.097543
## [99] train-rmse:0.097206
## [100] train-rmse:0.096907
xgb.pred.opt<-predict(xgb_opt, newdata = xgb_ig_validation, type='response')
# evaluation
xgb.mape <- mape(xgb.pred.opt, validation$price)
xgb.price_error <- mean(abs((exp(xgb.pred.opt) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(xgb.pred.opt) -1), observed=(exp(validation$price) -1)))
# plotting the results
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("XGBoost", ' MAPE: ', format(round(xgb.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(xgb.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
In this secion we’ll try to apply Regression with StepWise feature selection.
# adapting the dataset to work for the algorithm
lm_ig_training <- ig_training
lm_ig_training$yr_built <- as.numeric(lm_ig_training$yr_built)
lm_ig_training$yr_renovated <- as.numeric(lm_ig_training$yr_renovated)
lm_ig_validation <- ig_validation
lm_ig_validation$yr_built <- as.numeric(lm_ig_validation$yr_built)
lm_ig_validation$yr_renovated <- as.numeric(lm_ig_validation$yr_renovated)
#### Regression with StepWise feature selection
lm_0<-stepAIC(lm(formula = formula,
data=lm_ig_training),
trace=F)
lm.pred<-predict(lm_0, newdata = lm_ig_validation)
# evaluation
lm.mape <- mape(lm.pred, validation$price)
lm.price_error <- mean(abs((exp(lm.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(lm.pred) -1), observed=(exp(validation$price) -1)))
# plotting the results
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("StepWise", ' MAPE: ', format(round(lm.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(lm.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
in this section wel’ll try to apply Regression with regularization
# Defining the model
glmnet_cv<-cv.glmnet(x = xgb_ig_training,
nfolds = 5,
y = training[['price']],
alpha=1,
family = 'gaussian',
standardize = T)
plot.cv.glmnet(glmnet_cv)
glmnet_cv$lambda.min
## [1] 0.0001452647
glmnet_0<-glmnet(x = xgb_ig_training,
y = training[['price']],
family = 'gaussian',
alpha=1, lambda = glmnet_cv$lambda.min)
glmnet_0$beta
## 281 x 1 sparse Matrix of class "dgCMatrix"
## s0
## bedrooms -1.479584e-03
## bathrooms 3.492611e-02
## sqft_living 9.468635e-05
## sqft_lot 7.759506e-02
## floors -3.451766e-02
## waterfront 6.222213e-01
## view_1 1.186810e-01
## view_2 1.107920e-01
## view_3 1.917645e-01
## view_4 2.860010e-01
## condition_2 -2.432024e-03
## condition_3 1.838034e-01
## condition_4 2.398081e-01
## condition_5 3.075722e-01
## sqft_above 1.303086e-04
## sqft_basement 8.405461e-03
## yr_built_1901 -1.468549e-02
## yr_built_1902 -5.093923e-02
## yr_built_1903 4.294774e-02
## yr_built_1904 4.104774e-02
## yr_built_1905 4.051628e-02
## yr_built_1906 2.845928e-02
## yr_built_1907 6.363098e-02
## yr_built_1908 -5.487049e-02
## yr_built_1909 5.907798e-02
## yr_built_1910 1.410455e-02
## yr_built_1911 8.108415e-02
## yr_built_1912 5.098234e-03
## yr_built_1913 4.978905e-02
## yr_built_1914 .
## yr_built_1915 6.058584e-02
## yr_built_1916 -1.515324e-03
## yr_built_1917 .
## yr_built_1918 -3.724505e-03
## yr_built_1919 2.136295e-02
## yr_built_1920 -1.843906e-02
## yr_built_1921 1.924526e-02
## yr_built_1922 2.309567e-02
## yr_built_1923 6.119913e-02
## yr_built_1924 2.522968e-02
## yr_built_1925 1.477596e-02
## yr_built_1926 6.855579e-03
## yr_built_1927 1.887382e-02
## yr_built_1928 4.027681e-02
## yr_built_1929 1.897827e-02
## yr_built_1930 8.829040e-03
## yr_built_1931 4.761113e-02
## yr_built_1932 8.683536e-02
## yr_built_1933 6.213484e-03
## yr_built_1934 7.460732e-03
## yr_built_1935 -1.135776e-01
## yr_built_1936 -1.159552e-02
## yr_built_1937 2.998888e-02
## yr_built_1938 2.610192e-02
## yr_built_1939 2.424992e-02
## yr_built_1940 8.259341e-04
## yr_built_1941 1.760638e-02
## yr_built_1942 -9.915578e-03
## yr_built_1943 -5.453822e-02
## yr_built_1944 -2.079697e-02
## yr_built_1945 -6.168704e-02
## yr_built_1946 1.204894e-02
## yr_built_1947 -3.735979e-02
## yr_built_1948 -6.895406e-03
## yr_built_1949 .
## yr_built_1950 -2.480758e-02
## yr_built_1951 -4.851633e-02
## yr_built_1952 -1.605810e-02
## yr_built_1953 -4.283819e-02
## yr_built_1954 -6.081894e-02
## yr_built_1955 -6.173728e-02
## yr_built_1956 -2.021930e-02
## yr_built_1957 -4.418181e-02
## yr_built_1958 -5.109200e-02
## yr_built_1959 -5.536560e-02
## yr_built_1960 -6.986957e-02
## yr_built_1961 -7.916205e-02
## yr_built_1962 -7.424556e-02
## yr_built_1963 -4.488380e-02
## yr_built_1964 -9.625955e-02
## yr_built_1965 -7.459859e-02
## yr_built_1966 -9.407850e-02
## yr_built_1967 -6.071106e-02
## yr_built_1968 -4.571427e-02
## yr_built_1969 -4.517778e-02
## yr_built_1970 -4.828831e-02
## yr_built_1971 -5.571726e-02
## yr_built_1972 -5.614270e-02
## yr_built_1973 -3.447103e-02
## yr_built_1974 -5.961803e-02
## yr_built_1975 -6.862180e-02
## yr_built_1976 -4.299703e-02
## yr_built_1977 -6.210875e-02
## yr_built_1978 -8.989559e-02
## yr_built_1979 -8.055785e-02
## yr_built_1980 -4.233637e-02
## yr_built_1981 -1.230592e-02
## yr_built_1982 -2.255233e-02
## yr_built_1983 .
## yr_built_1984 -2.461431e-02
## yr_built_1985 -1.569808e-02
## yr_built_1986 -4.954929e-02
## yr_built_1987 -3.471095e-02
## yr_built_1988 -3.726636e-02
## yr_built_1989 -1.261129e-02
## yr_built_1990 -3.307039e-02
## yr_built_1991 -1.088332e-02
## yr_built_1992 3.377267e-03
## yr_built_1993 2.696437e-02
## yr_built_1994 -1.313832e-02
## yr_built_1995 5.191167e-03
## yr_built_1996 2.377057e-02
## yr_built_1997 -5.698978e-03
## yr_built_1998 -5.068217e-03
## yr_built_1999 2.879884e-02
## yr_built_2000 3.452288e-02
## yr_built_2001 4.662907e-02
## yr_built_2002 2.686386e-02
## yr_built_2003 1.789817e-02
## yr_built_2004 2.549285e-02
## yr_built_2005 .
## yr_built_2006 2.061399e-02
## yr_built_2007 3.156781e-03
## yr_built_2008 3.541296e-02
## yr_built_2009 6.133105e-02
## yr_built_2010 7.146815e-02
## yr_built_2011 7.666879e-02
## yr_built_2012 9.991995e-02
## yr_built_2013 9.921345e-02
## yr_built_2014 1.051839e-01
## yr_built_2015 1.838133e-01
## yr_renovated_1934 1.383457e-01
## yr_renovated_1940 -2.177388e-01
## yr_renovated_1944 .
## yr_renovated_1945 4.123770e-02
## yr_renovated_1946 2.402046e-01
## yr_renovated_1948 -1.502751e-02
## yr_renovated_1950 -2.505284e-01
## yr_renovated_1951 .
## yr_renovated_1953 -1.095507e-01
## yr_renovated_1954 4.067528e-01
## yr_renovated_1955 -8.110347e-02
## yr_renovated_1956 -1.914145e-01
## yr_renovated_1957 -2.722276e-02
## yr_renovated_1958 2.159922e-02
## yr_renovated_1959 -2.385730e-01
## yr_renovated_1960 -1.890948e-01
## yr_renovated_1962 -2.716533e-01
## yr_renovated_1963 -3.283154e-01
## yr_renovated_1964 -6.907815e-03
## yr_renovated_1965 -4.605112e-02
## yr_renovated_1967 1.285162e-01
## yr_renovated_1968 -1.451889e-01
## yr_renovated_1969 -1.143295e-01
## yr_renovated_1970 -1.075913e-01
## yr_renovated_1971 .
## yr_renovated_1972 .
## yr_renovated_1973 -2.568962e-01
## yr_renovated_1974 -3.631841e-01
## yr_renovated_1975 9.353236e-02
## yr_renovated_1976 -2.497798e-01
## yr_renovated_1977 1.102054e-01
## yr_renovated_1978 2.385313e-01
## yr_renovated_1979 6.338237e-02
## yr_renovated_1980 -4.233997e-02
## yr_renovated_1981 4.605575e-02
## yr_renovated_1982 -3.267275e-02
## yr_renovated_1983 6.865881e-02
## yr_renovated_1984 .
## yr_renovated_1985 -4.184842e-02
## yr_renovated_1986 1.780231e-01
## yr_renovated_1987 -4.196133e-03
## yr_renovated_1988 5.186214e-02
## yr_renovated_1989 1.396939e-01
## yr_renovated_1990 1.935327e-02
## yr_renovated_1991 1.467447e-01
## yr_renovated_1992 -2.443194e-04
## yr_renovated_1993 1.004962e-01
## yr_renovated_1994 1.485293e-01
## yr_renovated_1995 9.070973e-03
## yr_renovated_1996 2.891881e-02
## yr_renovated_1997 -4.261392e-02
## yr_renovated_1998 7.090450e-02
## yr_renovated_1999 1.442544e-01
## yr_renovated_2000 5.223952e-02
## yr_renovated_2001 8.942797e-02
## yr_renovated_2002 3.374476e-01
## yr_renovated_2003 1.017245e-01
## yr_renovated_2004 1.781804e-01
## yr_renovated_2005 7.459856e-02
## yr_renovated_2006 7.129004e-02
## yr_renovated_2007 1.535018e-01
## yr_renovated_2008 2.092712e-01
## yr_renovated_2009 2.511497e-01
## yr_renovated_2010 1.726380e-01
## yr_renovated_2011 6.491641e-02
## yr_renovated_2012 5.840516e-02
## yr_renovated_2013 1.328875e-01
## yr_renovated_2014 1.292709e-01
## yr_renovated_2015 -2.726982e-02
## zipcode_98002 -5.155605e-02
## zipcode_98003 -2.519797e-02
## zipcode_98004 7.757979e-01
## zipcode_98005 3.990207e-01
## zipcode_98006 3.807750e-01
## zipcode_98007 3.368283e-01
## zipcode_98008 3.647714e-01
## zipcode_98010 1.948504e-01
## zipcode_98011 -7.018574e-02
## zipcode_98014 -6.876293e-02
## zipcode_98019 -1.432063e-01
## zipcode_98022 1.527692e-01
## zipcode_98023 -7.342754e-02
## zipcode_98024 1.758253e-01
## zipcode_98027 2.905419e-01
## zipcode_98028 -1.055415e-01
## zipcode_98029 3.794873e-01
## zipcode_98030 -2.746105e-02
## zipcode_98031 -3.822223e-02
## zipcode_98032 -8.767720e-02
## zipcode_98033 3.664799e-01
## zipcode_98034 8.833171e-02
## zipcode_98038 9.695804e-02
## zipcode_98039 8.700714e-01
## zipcode_98040 5.870704e-01
## zipcode_98042 -5.469124e-03
## zipcode_98045 2.133428e-01
## zipcode_98052 2.394662e-01
## zipcode_98053 1.711794e-01
## zipcode_98055 -4.745051e-02
## zipcode_98056 9.268604e-02
## zipcode_98058 1.506511e-02
## zipcode_98059 1.144108e-01
## zipcode_98065 1.976417e-01
## zipcode_98070 6.052206e-02
## zipcode_98072 -1.907643e-02
## zipcode_98074 2.173386e-01
## zipcode_98075 2.591786e-01
## zipcode_98077 -5.483669e-02
## zipcode_98092 -1.508845e-03
## zipcode_98102 5.900410e-01
## zipcode_98103 3.804228e-01
## zipcode_98105 5.312689e-01
## zipcode_98106 5.579203e-02
## zipcode_98107 4.177461e-01
## zipcode_98108 4.918535e-02
## zipcode_98109 6.101729e-01
## zipcode_98112 6.640970e-01
## zipcode_98115 3.771884e-01
## zipcode_98116 4.061558e-01
## zipcode_98117 3.444787e-01
## zipcode_98118 1.791612e-01
## zipcode_98119 5.849446e-01
## zipcode_98122 4.504682e-01
## zipcode_98125 8.844400e-02
## zipcode_98126 2.319060e-01
## zipcode_98133 -4.656203e-02
## zipcode_98136 3.537941e-01
## zipcode_98144 3.245858e-01
## zipcode_98146 -1.130365e-02
## zipcode_98148 -3.336120e-02
## zipcode_98155 -1.124332e-01
## zipcode_98166 1.079610e-01
## zipcode_98168 -1.615768e-01
## zipcode_98177 7.960884e-02
## zipcode_98178 -7.217630e-02
## zipcode_98188 -7.143033e-02
## zipcode_98198 -5.456456e-02
## zipcode_98199 4.319017e-01
## lat 1.102685e+00
## long -2.756904e-01
## sqft_living15 6.929315e-05
## sqft_lot15 -1.671884e-03
## TotalSqFeet 2.051271e-06
## grade_agg_8 7.415099e-02
## grade_agg_9 1.448697e-01
## grade_agg_6 -1.183688e-01
## grade_agg_10 2.055613e-01
## grade_agg_11 3.524207e-01
## grade_agg_5 -2.393488e-01
## grade_agg_others 4.068964e-01
glmnet.pred <- predict(glmnet_0, newx = xgb_ig_validation)
# Evaluation
glmnet.mape <- mape(glmnet.pred, validation$price)
glmnetprice_error <- mean(abs((exp(glmnet.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(glmnet.pred) -1), observed=(exp(validation$price) -1)))
# Plotting the results
ggplot(my_data, aes(s0, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("Regression with regularization", ' MAPE: ', format(round(glmnet.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(glmnetprice_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
We splitted the original training data into train and validation to evaluate the candidate models. In order to generate the final submission we have to take instead all the data at our disposal.
In addition, we also applied a log transformation to the target variable, to revert this transformation you have to use the exp function.
In order to do my prediction I have tried all the combination of the models explained above. The best model is the optimized xgb.
# Train the model using all the data
final.model <- xgb_opt
# Predict the prices for the test data (i.e., we use the exp function to revert the log transformation that we applied to the target variable)
final.pred <- as.numeric(exp(predict(final.model, xgb_ig_test, type='response'))-1)
final.pred[is.na(final.pred)]
## numeric(0)
hist(final.pred, main="Histogram of Predictions", xlab = "Predictions")
xgb_submission <- data.frame(Id = original_test_data$id, price= (final.pred))
colnames(xgb_submission) <-c("Id", "price")
write.csv(xgb_submission, file = "submission1.csv", row.names = FALSE)